wd = "/Users/nabibahmed/Desktop/Local Spring 2021/STAT215/Coursework/HW2"
setwd(wd)
In this HW, we will evaluate the differentially expressed genes and pathways between breast cancer and normal breast tissues. Our collaborator generated RNA-seq on ten pairs of breast tumors and matched adjacent normal tissues, located at /n/stat115/2021/HW2/raw_data1. The experiments were run in two batches, each batch with 5 pairs of samples, and the batch information is provided in batch.csv. We have run Salmon for gene quantification which is provided in Cannon at /n/stat115/2021/HW2/raw_data1/Salmon_results. Mapping between Ensembl id and gene symbol is provides in tx2gene.csv.
Please install the following R/Bioconductor packages. Then try “library(package)” to make sure the package works.
Note: sva package with Combat function is used for batch effect removal;
DESeq2 package is used for differential gene expression analysis;
tximport package is used for importing transcript-level abundance into gene-level matrices for downstream analysis
ggplot2 package is used for general plotting in R;
pheatmap package is used for heatmap plotting;
dplyr package is used for data frame manipulations in R;
fgsea package is used for gene-set enrichment analysis.
# ```{r install, eval = FALSE}
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("sva")
# BiocManager::install("DESeq2")
# BiocManager::install("tximport")
# install.packages(c("ggplot2", "dplyr",
# "pheatmap"))
# BiocManager::install("fgsea")
# BiocManager::install("ComplexHeatmap")
library(ggplot2)
library(sva)
## Warning: package 'sva' was built under R version 4.0.3
## Warning: package 'genefilter' was built under R version 4.0.3
## Warning: package 'BiocParallel' was built under R version 4.0.3
library(DESeq2)
## Warning: package 'DESeq2' was built under R version 4.0.3
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'GenomicRanges' was built under R version 4.0.3
## Warning: package 'GenomeInfoDb' was built under R version 4.0.3
## Warning: package 'SummarizedExperiment' was built under R version 4.0.3
## Warning: package 'MatrixGenerics' was built under R version 4.0.3
## Warning: package 'Biobase' was built under R version 4.0.3
library(tximport)
## Warning: package 'tximport' was built under R version 4.0.3
library(dplyr)
library(fgsea)
## Warning: package 'fgsea' was built under R version 4.0.3
library(pheatmap)
library(ComplexHeatmap)
## Warning: package 'ComplexHeatmap' was built under R version 4.0.3
# additional libraries
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(caret)
## Loading required package: lattice
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:genefilter':
##
## spec
library(apeglm)
## Warning: package 'apeglm' was built under R version 4.0.3
library(EnhancedVolcano)
## Warning: package 'EnhancedVolcano' was built under R version 4.0.3
## Loading required package: ggrepel
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
library(ComplexHeatmap)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:caret':
##
## lift
## The following object is masked from 'package:GenomicRanges':
##
## reduce
## The following object is masked from 'package:IRanges':
##
## reduce
: Successfully install all the package to my RStudio. I also included additional libraries.
For RNA-seq analysis, visualization using principle component analysis (PCA) or hierarchical clustering is an essential step of exploratory analysis to identify potential batch effect. Please import transcript-level TPM values from Salmon output and convert to gene-level TPM values. Perform PCA of the samples using log2 transformed TPM values. Indicate different tissues (tumor or normal, using shape to show) and batch (1 or 2, using color to show) of each sample. Next try to use hierarchical clustering for these samples. Do the results suggest the presence of a batch effect?
For this question, you will load Salmon output at /n/stat115/2021/HW2/raw_data1/Salmon_results. You also need to read in batch information provided in /n/stat115/2021/HW2/raw_data1/batch.csv. Remember to convert Ensembl ID to gene symbol, using the mapping provided in /n/stat115/2021/HW2/raw_data1/tx2gene.csv.
: I used scp command to transfer over the Salmon_results, batch.csv, and tx2gene.csv to my location machine. Below is my bash code (run on my location machine).
# transfer Salmon_results to local directory (on local terminal)
scp -r stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW2/raw_data1/Salmon_results "Q1"
# transfer batch.csv to local directory (on local terminal)
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW2/raw_data1/batch.csv "Q1"
# transfer tx2gene.csv to local directory (on local terminal)
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW2/raw_data1/tx2gene.csv "Q1"
I load in my data files. I use tximport which takes my Salmon results and tx2gene csv to return a simple list with matrices, “abundance”, “counts”, and “length”, where the transcript level information is summarized to the gene-level. I also rename the columns accordingly.
### Your code here
# read data files
batch = read.csv("Q1/batch.csv")
tx2gene = read.csv("Q1/tx2gene.csv")
# using txi to read in all transcript data
# convert transcript-level TPM to gene-level TPM values (using tximport package)
# A simple list with matrices, "abundance", "counts", and "length", is returned, where the transcript level information is summarized to the gene-level
# Source: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html
salmon_files = paste("Q1/Salmon_results/", batch$X, ".sf", sep="")
txi = tximport(salmon_files, type = "salmon", tx2gene = tx2gene)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## summarizing abundance
## summarizing counts
## summarizing length
# labeling columns
colnames(txi[["abundance"]]) = batch$X
colnames(txi[["counts"]]) = batch$X
colnames(txi[["length"]]) = batch$X
I perform a log2(TPM+1) transform on the TPM data to resolve issues with 0 values (as instructed in the prompt and slack) and negative values for comparison in Q1.3. I also remove zero variance genes from my dataset.
# NEED TO FIX
# getting the transcript-level TPM values
TPM = txi[["abundance"]]
TPM.df = data.frame(TPM)
# Taking log2(TPM+1) transformation (resolves issue with 0s and negatives)
logTPM = log2(TPM+1)
logTPM.df = data.frame(t(logTPM))
# remove zero variance gene
logTPM.df = logTPM.df[,-nearZeroVar(logTPM.df)]
Using prcomp, I perform PCA on the logTPM values. I make the scree plot and see the first two principal components account for ~42.3% of the total variance.
# Perform PCA of the samples using log2 transformed TPM values & making scree plot
res.pca = prcomp(logTPM.df, scale = TRUE, center = TRUE)
pca.var.per = round(res.pca$sdev^2/sum(res.pca$sdev^2)*100,1)
fviz_eig(res.pca, addlabels=TRUE, ylim=c(0,100),
geom=c("bar", "line"), barfill="gold",
barcolor="grey", linecolor="red", ncp=10) +
labs(title="PCA Coverage",
x="Principal Components", y="% of variances")
Using the first two principal components, I plot my batches. I designated shape for tissue samples and color for batch.
# making pca plot data frame
pca_plot_data = data.frame(sample = rownames(res.pca$x),
X = res.pca$x[,1],
Y = res.pca$x[,2],
batch = batch$batch,
tissue = batch$tissue)
# plot PCA graph
# shapes for tissues, colors for batch
ggplot(data=pca_plot_data, aes(x=X, y=Y, shape=tissue, color=factor(batch))) +
xlab(paste("PC1: ",pca.var.per[1],"%",sep ="")) +
ylab(paste("PC2: ",pca.var.per[2],"%",sep ="")) +
geom_point(size=7) +
ggtitle("PCA Graph (PC1 vs PC2)")
From the PCA graph above, we see that for batch 1, tumor and normal tissue are on separate sides (tumor on the right and normal the left). Whereas for batch 2, tumor and normal tissues are in the same side/area. I would expect the relationship between tumor and normal tissues to be the same between the batches - given that the two batches show different patterns, this PCA plot suggest potential batch effect.
Now let’s try to use hierarchical clustering for these samples for checking for batch effects on logTPM values.
# making and scaling logTPM values
hc.logTPM = scale(logTPM.df)
# hierarchical clustering using euclidean
dist.hc.logTPM = dist(hc.logTPM, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(dist.hc.logTPM, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
The samples N1-5 and T1-5 are batch 2; the samples N6-10 and T6-10 are batch 1. It seems from the dendrogram, the samples of batch 2 are all clustered together despite being different tissue types - the clustering is happening per batch and not tissue type as we would expect. Again this seems odd and there appears to be potential batch effect.
# plotting correlation heatmap
pheatmap(cor(logTPM))
Following up with additional analysis with the pheatmap command, I see further proof of batch effect - the samples within batch 2 (T1-5 and N1-5) are more correlated each other than tumor tissues between batches and normal tissues between batches as we should expect.
Run COMBAT on the samples to remove the batch effect. Visualize the results using a similar PCA and hierarchical clustering as Problem 2. Provide evidence that the batch effects are successfully adjusted.
: I ran COMBAT on TPM (as per instructions on slack). I need to filter out genes with zero variance - I used nearZeroVar function from caret.
### Your code here
# remaking TPM dataframe for consistency
q2.TPM = txi[["abundance"]]
q2.TPM.df = data.frame(t(q2.TPM))
# removing zero variance genes
q2.TPM.df = q2.TPM.df[, -nearZeroVar(q2.TPM.df)]
# performing combat analysis
combat_dat = ComBat(t(q2.TPM.df), batch$batch, par.prior=TRUE)
## Found 345 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
Then I do log2(combat_dat+1) transform (to handle zero value and negatives) those results to run PCA, hierarchical clustering, and pheatmap analysis.
# Taking log2(combat_dat+1) transformation (resolves issue with 0s and negatives)
combat.logTPM = log2(combat_dat+1)
combat.logTPM.df = data.frame(t(combat.logTPM))
Using prcomp, I perform PCA on the post ComBat logTPM values. I make the scree plot and see the first two principal components account for ~38.8% of the total variance.
# Perform PCA of the samples using log2 transformed combat TPM values & making scree plot
res.pca = prcomp(combat.logTPM.df, scale = TRUE, center = TRUE)
pca.var.per = round(res.pca$sdev^2/sum(res.pca$sdev^2)*100,1)
fviz_eig(res.pca, addlabels=TRUE, ylim=c(0,100),
geom=c("bar", "line"), barfill="gold",
barcolor="grey", linecolor="red", ncp=10) +
labs(title="PCA Coverage",
x="Principal Components", y="% of variances")
Using the first two principal components, I plot my batches post ComBat. I designated shape for tissue samples and color for batch as before.
# making pca plot data frame
pca_plot_data = data.frame(sample = rownames(res.pca$x),
X = res.pca$x[,1],
Y = res.pca$x[,2],
batch = batch$batch,
tissue = batch$tissue)
# plot PCA graph
# shapes for tissues, colors for batch
ggplot(data=pca_plot_data, aes(x=X, y=Y, shape=tissue, color=factor(batch))) +
xlab(paste("PC1: ",pca.var.per[1],"%",sep ="")) +
ylab(paste("PC2: ",pca.var.per[2],"%",sep ="")) +
geom_point(size=7) +
ggtitle("PCA Graph (PC1 vs PC2) - Post ComBat")
From the PCA graph above, we see batch 1 and batch 2 appear more similar. Batch 1 and 2’s tumor tissue reside in the same area of the graph and the same is true for the normal tissue. The distinction between the tissue type is much more visible (as we would expect)! This shows that ComBat helped to address potential batch effects as the pattern between normal vs tumor tissues are more similar in the two batches (unlike in QI.2)
Now let’s try to use hierarchical clustering for these samples for checking for batch effects on post ComBat logTPM values.
# making and scaling logTPM values
combat.hc.logTPM = scale(combat.logTPM.df)
# hierarchical clustering using euclidean
combat.dist.hc.logTPM = dist(combat.hc.logTPM, method = "euclidean")
# Hierarchical clustering using Complete Linkage
combat.hc1 <- hclust(combat.dist.hc.logTPM, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
The samples N1-5 and T1-5 are batch 2; the samples N6-10 and T6-10 are batch 1. The dendrogram is more like what we should expect with clustering happening based on tissue types (tumor vs normal) as opposed to batch. We see that normal tissues cells were clustered with each other before being added to a group with tumor tissues (and vice versa for tumor tissues). For example, samples N1-5 clustered and T1-5 clustered before being cluster together. Similarly, on the right, T6-10 clustered and N6-10 clustered before clustering with another tissue type.
# plotting correlation heatmap
pheatmap(cor(combat.logTPM))
Following up with additional analysis with the pheatmap command, I see further proof of ComBat working to address batch effect! We see that the highest correlation is between tissues of the same type - unlike in Q1.2 where the highest correlations were for tissues within batch 2.
Run DESeq2 based on paired samples adjusting for the batch effect to identify differentially-expressed genes between tumor and normal tissues. How many genes are expressed higher in tumors than normal. Let’s use 1) FDR < 0.01 and 2) Log2 fold change > 1 as the cutoff.
Note: please use the raw_count columns of the Salmon result and convert these to integer values for DESeq2.
Identify the top 5 most (by Log2FC) over expressed genes (FDR < 0.01) in tumor and normal, respectively.
: As per slack message, I can use the counts from txi object. For my design matrix, I specified the batch then condition (tissue). Also per slack, the FDR is the padj in res dataframe below.
### Your code here
# converting numeric to factor for design matrix
ddsBatch = batch
ddsBatch$X = as.factor(ddsBatch$X)
ddsBatch$patient = as.factor(ddsBatch$patient)
ddsBatch$batch = as.factor(ddsBatch$batch)
ddsBatch$tissue = as.factor(ddsBatch$tissue)
# performing DESeq2 (lab 3 code)
ddsTxi = DESeqDataSetFromTximport(txi,
colData = ddsBatch,
design = ~ batch + tissue)
## using counts and average transcript lengths from tximport
dds <- DESeq(ddsTxi)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
resultsNames(dds)
## [1] "Intercept" "batch_2_vs_1" "tissue_Tumor_vs_Normal"
# converting to data frame
res.df = data.frame(res@listData)
rownames(res.df) = res@rownames
# Filter based on criteria (1) FDR < 0.01
res.df = na.omit(res.df[res.df$padj<0.01,])
# Filter based on criteria (1) |LF2Change| > 1
res.df = na.omit(res.df[abs(res.df$log2FoldChange)>1,])
# Get top 5 and bottom 5 by Log2FC
res.df = res.df[order(res.df$log2FoldChange),]
head(res.df, 5)
## baseMean log2FoldChange lfcSE stat pvalue
## FABP7 170.14847 -8.827787 0.8583438 -10.28468 8.262030e-25
## UGT2B28 89.77218 -8.525993 1.9342661 -4.40787 1.043924e-05
## SOX10 1333.74853 -7.749225 0.7417718 -10.44691 1.513719e-25
## SMYD1 172.44383 -7.534316 0.5875541 -12.82319 1.215998e-37
## KRT14 14674.46105 -7.487382 0.7151574 -10.46956 1.191980e-25
## padj
## FABP7 1.989313e-22
## UGT2B28 1.515338e-04
## SOX10 3.826934e-23
## SMYD1 1.152842e-34
## KRT14 3.064601e-23
tail(res.df, 5)
## baseMean log2FoldChange lfcSE stat pvalue padj
## SLC30A8 12449.73564 7.709832 1.155675 6.671280 2.535816e-11 1.582954e-09
## CLEC3A 6301.45669 7.870149 1.300904 6.049756 1.450656e-09 6.963606e-08
## HOXB13 83.22289 8.083036 1.387366 5.826176 5.671188e-09 2.486308e-07
## TRPA1 5873.64842 9.259529 1.139270 8.127596 4.378883e-16 4.677695e-14
## CASP14 145.43763 10.421588 2.875133 3.624732 2.892610e-04 2.172179e-03
Since our condition is tissue_Tumor_vs_Normal, I assume negative log2foldchange values correspond to Normal. Thus, the 5 most expressed gene for Normal tissues is FABP7, UGT2B28, SOX10, SMYD1, and KRT14. For tumor tissues CASP14, TRPA1, HOXB13, CLEC3A, SLC30A8. (If I got the direction wrong, my apologize for misinterpreting).
Visualize the differential gene expression values by making a volcano and an MA plot to summarize the differences between tumor and normal. In the volcano plot, draw the statistically significant (FDR < 0.01, Log2FC > 1) tumor up-genes red and down-genes blue.
Note: Be sure to use the lfcShrink function to get more robust estimates of the fold-changes for genes.
: Below is the MA plot and Volcano plot. For MA plot, Slack said to use default type and not to set lfcThreshold. Similarly for the volcano plot, I made the region with log2FC>1 red for tumor up-gene and region with log2FC<-1 blue for tumor down-gene.
### Your code here
# MA Plot (from Lab 3)
resApeT = lfcShrink(dds, coef="tissue_Tumor_vs_Normal")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
DESeq2::plotMA(resApeT, ylim=c(-3,3), cex=.8, main="MA Plot")
abline(h=c(-1,1), col="dodgerblue", lwd=2)
# volcano plot
with(resApeT, plot(log2FoldChange, -log10(pvalue),
pch=20, main="Volcano Plot"))
# adding line for Log2FC > 1
abline(v=1, lty=2); abline(v=-1, lty=2)
# adding line for FDR < 0.01
abline(h=-log10(0.01), lty=2)
# tumor up-genes red (log2FC>1)
with(subset(resApeT, padj<.01 & log2FoldChange>1),
points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
# tumor down-genes blue (log2FC<-1)
with(subset(resApeT, padj<.01 & log2FoldChange<(-1)),
points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))
Try kmeans (try k = 4 or 7) clustering to group differentially expressed genes into different clusters. How many genes are there in each cluster? Draw a heatmap showing gene clusters.
: From Slack: For I.6, you should subset Combat-corrected matrix by differentially expressed genes obtained from DESeq2. I will run kmeans (and heatmap) on combat.logTPM.df (ie log-transformed COMBAT expression matrix) to handle batch effect from the genes DESeq2 said were differentially expressed.
### Your code here
# subsetting differentially expressed genes from ComBat results
i6.data = subset(t(combat.logTPM.df), rownames(t(combat.logTPM.df)) %in% rownames(res.df))
# removing NAs and scaling
i6.data = scale(na.omit(i6.data))
Trying kmeans with \(k=4\). The composition of genes in the four categories are 448, 512, 711, and 356 (clusters 1-4).
set.seed(215)
k4 = kmeans(i6.data, centers = 4, nstart = 25)
fviz_cluster(k4, data = i6.data, geom="point")
table(k4$cluster)
##
## 1 2 3 4
## 448 512 711 356
Trying kmeans with \(k=7\). The composition of genes in the four categories are 429, 298, 245, 181, 293, 266, and 315 (clusters 1-7).
set.seed(215)
k7 = kmeans(i6.data, centers = 7, nstart = 25)
fviz_cluster(k7, data = i6.data, geom="point")
table(k7$cluster)
##
## 1 2 3 4 5 6 7
## 429 298 245 181 293 266 315
Comparing \(k=4\) and \(k=7\), I would say they performed fairly similar in splitting up the groups. However, I think \(k=4\) looks better and is more natural than \(k=7\) (no awkward spaces in the cluster like the orange group for \(k=7\)). The elbow graph below agrees with a more prominent bend at \(k=4\) compared to \(k=7\). Analyzing the elbow graph more, it seems the sharpest bend is at \(k=2\). However since \(k=7\) is the magic number from lecture, we’ll stick with that!
set.seed(215)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(i6.data, k, nstart = 25)$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
I drew a heatmap that is sample (x-axis) vs gene (y-axis) based on \(k=7\) clusters (magic number from lecture). I remove the row titles since it was hard to read (all on top of one another). Look at the heatmap, each cluster has a distinctive pattern/trend. Furthermore, there’s a visible distinction as we go from the normal samples to the tumor samples. Thus the heatmap shows a difference in gene expression between tumor and normal tissues (as we’d expect).
set.seed(215)
Heatmap(i6.data, name = "gene expression",
row_km = 7,
column_title = "HeatMap",
column_order = colnames(i6.data),
show_row_names = FALSE)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` argument by explicitly setting
## TRUE/FALSE to it.
##
## Set `ht_opt$message = FALSE` to turn off this message.
## 'magick' package is suggested to install to give better rasterization.
##
## Set `ht_opt$message = FALSE` to turn off this message.
If you run DESeq2 without removing batch effect, how many differential genes do you get? How do their K-means clustering look? Does batch effect removal gives more interpretable results?
: This would be DESeq2 analysis done in I.4 without adding batch to the design component of DESeqDataSetFromTximport.
### Your code here
# performing DESeq2 (lab 3 code)
ddsTxi.i7 = DESeqDataSetFromTximport(txi,
colData = ddsBatch,
design = ~ tissue)
## using counts and average transcript lengths from tximport
dds.i7 <- DESeq(ddsTxi.i7)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 633 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res.i7 <- results(dds.i7)
resultsNames(dds.i7)
## [1] "Intercept" "tissue_Tumor_vs_Normal"
# converting to data frame
res.df.i7 = data.frame(res.i7@listData)
rownames(res.df.i7) = res.i7@rownames
# Filter based on criteria (1) FDR < 0.01
res.df.i7 = na.omit(res.df.i7[res.df.i7$padj<0.01,])
# Filter based on criteria (1) |LF2Change| > 1
res.df.i7 = na.omit(res.df.i7[abs(res.df.i7$log2FoldChange)>1,])
nrow(res.df.i7)
## [1] 1352
If I ran DESeq2 without removing batch effect, I would get 1352 differential genes (compared to 2037 differential genes controlling for batch effect).
Analyzing their kmeans for \(k=4\) and \(k=7\). I will subset into the original data set.
# subsetting differentially expressed genes from ComBat results
i7.data = subset(t(logTPM.df), rownames(t(logTPM.df)) %in% rownames(res.df.i7))
# removing NAs and scaling
i7.data = scale(na.omit(i7.data))
set.seed(215)
k4.i7 = kmeans(i7.data, centers = 4, nstart = 25)
fviz_cluster(k4.i7, data = i7.data, geom="point")
table(k4.i7$cluster)
##
## 1 2 3 4
## 231 461 306 351
k7.i7 = kmeans(i7.data, centers = 7, nstart = 25)
fviz_cluster(k7.i7, data = i7.data, geom="point")
table(k7.i7$cluster)
##
## 1 2 3 4 5 6 7
## 187 234 149 144 222 154 259
The respective cluster size for \(k=4\) is 231, 461, 306, and 351. The respective cluster size for \(k=7\) is 187, 234, 149, 144, 222, 154, and 259. The clusters for \(k=4\) and \(k=7\) look similar. It seems kmeans algorithm is good at finding clusters of gene despite batch effect.
Thus, I would say that batch effect removal doesn’t really gives better results as seen by the kmeans cluster plots.
set.seed(215)
Heatmap(i7.data, name = "gene expression",
row_km = 7,
column_title = "HeatMap",
column_order = colnames(i7.data),
show_row_names = FALSE)
For the heatmap, I could not make out a difference between this and the heatmap from I.6 - looked fairly similar.
From the batch-removed DESeq2 run results, extract the top 200 tumor-upregulated genes (by Log2FC, FDR < 0.01). Run DAVID GO analysis (http://david.abcc.ncifcrf.gov/) to see whether these genes are enriched in specific biological process (BP), pathways, etc.
: Upregulated genes are those whose log2FoldChange > 1. We already sorted our genes by log2FoldChange in before, so let’s pull the last 200 rows.
# export the list of genes
write_tsv(data.frame(rownames(tail(res.df, 200))),
path = "i.8.txt")
## Warning: The `path` argument of `write_tsv()` is deprecated as of readr 1.4.0.
## Please use the `file` argument instead.
Below is the enrichment table from David. Slack said to just post a screenshot.
Run Gene Set Enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp) using the summary statistics from problem 4. Show the top five gene sets or pathways that best capture the differentially expressed genes between tumor than normal. Comment on the biological relevance of the results. Plot GSEA enrichment plot for an interesting pathway.
Mapping between gene sets and pathways is provided in /n/stat115/2021/HW2/raw_data1/c2.cp.kegg.v7.1.symbols.gmt file.
: I copied over the .gmt file via scp.
# transfer mapping to local directory (on local terminal)
scp stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW2/raw_data1/c2.cp.kegg.v7.1.symbols.gmt "Q1"
I rank out my DESeq2 data and made a plot.
# ranking all genes in the data set
gseaDat = res.df
ranks = res.df$log2FoldChange
names(ranks) = rownames(res.df)
barplot(sort(ranks, decreasing = T))
I import the pathway file and correct the shape for proper GSEA analysis.
# loading in pathways data and correcting form
pathways = read.delim("Q1/c2.cp.kegg.v7.1.symbols.gmt",
header = FALSE, stringsAsFactors = FALSE, quote = "", sep = "\t")
pathways = subset(pathways, select = -V2)
pathways = t(pathways)
colnames(pathways) = pathways[1,]
pathways = pathways[-1,]
pathways = data.frame(pathways)
I print the top five gene sets or pathways that best capture the differentially expressed genes. They are KEGG_CELL_CYCLE, SFN, KEGG_RETINOL_METABOLISM, KEGG_PPAR_SIGNALING_PATHWAY, and KEGG_ECM_RECEPTOR_INTERACTION.
Using the Broad Institute website, I saw that KEGG_CELL_CYCLE is related to mitotic cell cycle progression - this make sense as tumors are irregular production in cell mitosis. I also found the other pathways KEGG_PPAR_SIGNALING_PATHWAY and KEGG_ECM_RECEPTOR_INTERACTION related to cell proliferation which again makes sense for tumor data (since tumors are related to irregular cell production).
# running GSEA
fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize = 500, nperm=1000)
## Warning in fgsea(pathways, ranks, minSize = 15, maxSize = 500, nperm = 1000):
## You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To
## run fgseaMultilevel, you need to remove the nperm argument in the fgsea function
## call.
head(fgseaRes[order(padj, -abs(NES)), ], n=5)
## pathway pval padj ES NES
## 1: KEGG_CELL_CYCLE 0.002087683 0.005353319 0.6613172 2.621403
## 2: SFN 0.002141328 0.005353319 0.6536100 2.244823
## 3: KEGG_RETINOL_METABOLISM 0.001919386 0.005353319 -0.6167586 -2.203804
## 4: KEGG_PPAR_SIGNALING_PATHWAY 0.001869159 0.005353319 -0.5677312 -1.995799
## 5: KEGG_ECM_RECEPTOR_INTERACTION 0.252676660 0.448598131 0.3444287 1.182940
## nMoreExtreme size leadingEdge
## 1: 0 23 CDC25C,SMC1B,CDC45,CDC20,BUB1,PLK1,...
## 2: 0 15 CCNE2,PKMYT1,BUB1B,ESPL1,TTK,CCNB2,...
## 3: 0 16 UGT2B28,UGT2B11,ADH4,RDH5,ADH1B,ADH1C,...
## 4: 0 15 FABP7,PCK1,PLIN1,FABP4,ADIPOQ,LPL,...
## 5: 117 15 IBSP,COL2A1,SPP1,COMP,FN1,COL5A2,...
I plot the GSEA enrichment plot for the KEGG_CELL_CYCLE pathway.
plotEnrichment(pathways[["KEGG_CELL_CYCLE"]], ranks)
We provide you z-score normalized expression data of 50 breast tumor samples, 50 normal breast samples (your training and cross-validation data), and 20 samples without diagnosis (your testing data). We want to use the 100 samples with known diagnosis to train machine learning models in order to predict the 20 unknown samples.
You will need the following libraries in R: ggplot2 and ggfortify for plotting, MASS and caret for machine learning, and pROC is for evaluating testing performance. The YouTube video on caret (https://youtu.be/z8PRU46I3NY) and the package documentation (http://topepo.github.io/caret/index.html) might be helpful.
All data for Part II are provided at /n/stat115/2021/HW2/raw_data2.
library(ggplot2)
library(ggfortify)
library(pROC)
library(caret)
# additional libraries
library(e1071)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:genefilter':
##
## area
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loaded glmnet 4.1-1
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:BiocGenerics':
##
## type
## The following object is masked from 'package:ggplot2':
##
## alpha
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(ROCR)
# transfer data to local directory (on local terminal)
scp -r stat215u2107@login.rc.fas.harvard.edu:/n/stat115/2021/HW2/raw_data2 "Q2"
# read in data
train_label = read.delim("Q2/raw_data2/BRCA_phenotype.txt",
header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
train_data = read.delim("Q2/raw_data2/BRCA_zscore_data.txt",
header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
test_label = read.delim("Q2/raw_data2/diagnosis.txt",
header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
test_data = read.delim("Q2/raw_data2/unknown_samples.txt",
header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
Run PCA for dimension reduction on the 100 samples with known labels, and draw these 100 samples in a 2D plot. Do cancer and normal separate from the first two PCs? Would this be sufficient to classify the unknown samples?
z-score normalized data are provided in BRCA_zscore_data.txt. Phenotype data is in BRCA_phenotype.txt.
: We have a 100 samples (rows) and many gene variables (columns).
### Your code here
# pca with prcomp command
res.pca = prcomp(train_data, scale = TRUE, center = TRUE)
# scree plot
pca.var.per = round(res.pca$sdev^2/sum(res.pca$sdev^2)*100,1)
fviz_eig(res.pca, addlabels=TRUE, ylim=c(0,100), geom = c("bar", "line"),
barfill = "gold", barcolor="grey", linecolor = "red", ncp=10) +
labs(title = "PCA Coverage",
x = "Principal Components", y = "% of variances")
Based on the scree plot, the first two principal components capture 60.9% of the variance.
# making pca plot data frame
pca_plot_data = data.frame(sample = rownames(res.pca$x),
X = res.pca$x[,1],
Y = res.pca$x[,2])
pca_plot_data = merge(pca_plot_data, train_label, by="sample")
# making pca plot
ggplot(data = pca_plot_data, aes(x=X, y=Y, label=sample, color=phenotype)) +
geom_point() +
xlab(paste("PC1: ",pca.var.per[1],"%",sep ="")) +
ylab(paste("PC2: ",pca.var.per[2],"%",sep ="")) +
theme_bw() +
ggtitle("PCA Graph")
From the PCA graph, cancer and normal tissue seem to separate well from the first two PCs - there are visible clusters for the two types of tissues. However, normal and tumor tissue seems to have a close boundary to each other. Thus, it may not be clear which tissue type a new sample is if it falls in the boundary area between red and blue dots. Thus, I would not say this is sufficient to classify the unknown samples because there is uncertainty near the boundary area.
Draw a plot showing the cumulative % variance captured from the top 100 PCs. How many PCs are needed to capture 90% of the variance?
: I made a cumulative % variance captured plot. I would need 25 PCs to capture 90% of the variance.
### Your code here
pca.var = res.pca$sdev^2/sum(res.pca$sdev^2)
pca.var.acc = pca.var %>% accumulate(`+`)
min(which(pca.var.acc > 0.9))
## [1] 25
plot(pca.var.acc, main="Cumulative % Variance Captured",
xlab="PC", ylab="Cumulative % Variance")
abline(h=0.9, col="red", lty=2)
abline(v=25, col="blue", lty=3)
Apply machine learning methods (KNN, logistic regression, Ridge regression, LASSO, ElasticNet, random forest, and support vector machines) on the top 25 PCs of the training data and 5-fold cross validation to classify the samples. caret and MASS already implemented all of the machine learning methods, including cross-validation, so calling each is only one command. In order to get consistent results from different runs, use set.seed().
: As per Slack, KNN and SVM need to be scale since it uses Euclidean distances. As for the other models, I used caret and specified models based on Lab and Slack.
### Your code here
set.seed(215)
# filter for the 25 top PCs
train_df = data.frame(res.pca$x[,c(1:25)])
train_outcome = data.frame(sample = rownames(train_df))
train_outcome = merge(train_outcome, train_label, by="sample")
train_outcome$phenotype = as.factor(train_outcome$phenotype)
# preProcessing for algorithms using Euclidean distances
scaled_train_df = predict(preProcess(train_df, method = c("center", "scale")),
train_df)
# fixed parameters
control = trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
classProbs = TRUE)
metric = "Accuracy"
lambda = 10^seq(-3, 3, length = 100)
# KNN
knn = train(x = scaled_train_df,
y = train_outcome$phenotype,
method = "knn",
metric = metric,
trControl = control,
tuneGrid = expand.grid(k=seq(1,10,1)))
# Logistic Regression
# https://stackoverflow.com/questions/39550118/cross-validation-function-for-logistic-regression-in-r
lr = train(x = train_df,
y = train_outcome$phenotype,
method = "glm",
family = binomial(),
metric = metric,
trControl = control)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Ridge Regression
ridge = train(x = train_df,
y = train_outcome$phenotype,
method = "glmnet",
trControl = control,
tuneGrid = expand.grid(alpha = 0, lambda = lambda))
# LASSO
lasso = train(x = train_df,
y = train_outcome$phenotype,
method = "glmnet",
trControl = control,
tuneGrid = expand.grid(alpha = 1, lambda = lambda))
# ElasticNet
elastic = train(x = train_df,
y = train_outcome$phenotype,
method = "glmnet",
trControl = control,
tuneLength = 10)
# Random Forest
rf = train(x = train_df,
y = train_outcome$phenotype,
method = "rf",
trControl = control,
tuneGrid = expand.grid(mtry=seq(1,15,1)))
# SVM
svm = train(x = train_df,
y = train_outcome$phenotype,
method = "svmLinear",
trControl = control,
tuneGrid = expand.grid(C=c(0.01, 0.1, 1, 10)))
Summarize the performance of each machine learning method, in terms of accuracy and kappa.
: Below is a table summarizing my results. For models that had hyper parameters, I specified the hyper parameters that produced the best model and only show that model’s accuracy and kappa.
From this analysis of kappa and accuracy, the best performing model was ElasticNet with the highest accuracy (0.9704762) and highest kappa (0.9409955).
### Your code here
# KNN summary
knn_best = knn[["bestTune"]]
knn_results = knn[["results"]]
knn_results[knn_results$k == knn_best$k, ]
## k Accuracy Kappa AccuracySD KappaSD
## 1 1 0.76 0.5219608 0.05477226 0.1091842
# Logistic summary
lr[["results"]]
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.83 0.659596 0.1036822 0.2070251
# Ridge summary
ridge_best = ridge[["bestTune"]]
ridge_results = ridge[["results"]]
ridge_results[ridge_results$lambda == ridge_best$lambda, ]
## alpha lambda Accuracy Kappa AccuracySD KappaSD
## 27 0 0.03764936 0.9504762 0.9009955 0.03537137 0.07074571
# Lasso summary
lasso_best = lasso[["bestTune"]]
lasso_results = lasso[["results"]]
lasso_results[lasso_results$lambda == lasso_best$lambda, ]
## alpha lambda Accuracy Kappa AccuracySD KappaSD
## 24 1 0.02477076 0.9699499 0.9398274 0.02748912 0.05502383
# Elastic summary
elastic_best = elastic[["bestTune"]]
elastic_results = elastic[["results"]]
elastic_results[elastic_results$lambda == elastic_best$lambda &
elastic_results$alpha == elastic_best$alpha, ]
## alpha lambda Accuracy Kappa AccuracySD KappaSD
## 26 0.3 0.01163312 0.9704762 0.9409955 0.04446711 0.08891236
# Random Forest summary
rf_best = rf[["bestTune"]]
rf_results = rf[["results"]]
rf_results[rf_results$mtry == rf_best$mtry,]
## mtry Accuracy Kappa AccuracySD KappaSD
## 3 3 0.9499499 0.8995929 0.03539979 0.07080576
# SVM summary
svm_best = svm[["bestTune"]]
svm_results = svm[["results"]]
svm_results[svm_results$C == svm_best$C,]
## C Accuracy Kappa AccuracySD KappaSD
## 1 0.01 0.9304261 0.8604734 0.08312297 0.1664324
Compare the performance difference between logistic regression, Ridge, LASSO, and ElasticNet. In LASSO, how many PCs have non-zero coefficient? In ElasticNet, what is the lambda for Ridge and LASSO, respectively?
: In logistic regression, the curve is fit to maximize the probability for the observed data. Ridge, Lasso, and ElasticNet add a penalty term to the regression to help address variance and issues of multicollinearity (correlations between predictor variables). The effect is that they shrink down the coefficients on predictors. The table below of coefficients demonstrates this effect - the coefficients in Ridge, Lasso, and Elastic are much smaller than the coefficients in logistic.
Ridge regression adds a square penalty and the coefficients only asymptotically approach zero, but never reach it. Thus for the Ridge regression, even though the coefficients have shrunk, none of the coefficients are zero. Ridge regression is useful when most predictors are useful but as we saw in the scree plots of the PCA, the first few PCs actually accounted for most of the variation. Thus I would not deem ridge regression as the best model choice.
Lasso regression adds an absolute value penalty and the the coefficients can shrink down to zero. Thus for the Lasso regression, we see 16 PCs have coefficient of zero (meaning 9 PCs have non-zero coefficient). Lasso is great when we have very few important predictors and lots of useless predictors.
ElasticNet regression combines both the ridge and lasso penalty and it had the highest accuracy and kappa value out of all the models. The best ElasticNet model hyper parameters were \(\alpha=0.3, \lambda = 0.01163312\); according to the r documentation, that means lasso penalty lambda is 0.3 and the ridge penalty lambda is 0.35. Coefficients can shrink to zero in ElasticNet (we see some zero coefficients but not as much as Lasso regression). It’s optimal for situations were there are correlations between the predictors (which to me makes sense here with gene data).
### Your code here
# extracting coefficients from best models
lr_coef = coef(lr$finalModel)
ridge_coef = coef(ridge$finalModel, ridge$bestTune$lambda)
lasso_coef = coef(lasso$finalModel, lasso$bestTune$lambda)
elastic_coef = coef(elastic$finalModel, elastic$bestTune$lambda)
# printing out table of coefficients
coefs = data.frame(as.matrix(cbind(lr_coef, ridge_coef, lasso_coef, elastic_coef)))
colnames(coefs) = c("Logistic", "Ridge", "Lasso", "Elastic")
print(coefs)
## Logistic Ridge Lasso Elastic
## (Intercept) 22.43418789 0.0807000106 -0.064265430 0.163937829
## PC1 -0.39626955 -0.0050051000 -0.005493876 -0.007843627
## PC2 1.86303276 0.0472786826 0.081214148 0.072857719
## PC3 -0.63393983 -0.0210761896 -0.015886527 -0.030924611
## PC4 0.02548581 -0.0130267082 -0.002764961 -0.013623063
## PC5 -1.12707314 -0.0227188293 -0.007852772 -0.030039301
## PC6 -0.28673016 0.0058275721 0.000000000 0.004214186
## PC7 -0.44279441 -0.0062218648 0.000000000 0.000000000
## PC8 -1.30466583 0.0262947315 0.019832036 0.038120120
## PC9 -2.16086885 0.0048692039 0.000000000 0.003330861
## PC10 -2.70127503 0.0035641020 0.000000000 0.000000000
## PC11 -0.25186288 0.0145765829 0.000000000 0.013698308
## PC12 -1.07748631 -0.0206988332 -0.014512595 -0.035200651
## PC13 0.58367929 0.0009953818 0.000000000 0.000000000
## PC14 0.12956740 0.0012856369 0.000000000 0.000000000
## PC15 -1.24395008 -0.0145722034 0.000000000 -0.020981469
## PC16 -0.85417330 -0.0109033319 0.000000000 -0.014132494
## PC17 -0.98608767 -0.0136219024 0.000000000 -0.024680119
## PC18 0.75416674 0.0214241135 0.037650383 0.034321567
## PC19 -1.20328013 0.0023995327 0.000000000 0.000000000
## PC20 1.58504740 0.0150938176 0.000000000 0.021488752
## PC21 0.95167734 0.0187477007 0.007384289 0.031102593
## PC22 -1.84956931 -0.0128646772 0.000000000 -0.021373158
## PC23 2.54379599 0.0146114055 0.000000000 0.020300726
## PC24 -2.07546807 -0.0155809787 0.000000000 -0.025716174
## PC25 -0.33558080 0.0062881778 0.000000000 0.000000000
Use the PCA projections in Q1 to obtain the first 25 PCs of the 20 unknown samples. Use one method that performs well in Q4 to make predictions. Caret already used the hyper-parameters learned from cross-validation to train the parameters of each method on the full 100 training data. You just need to call this method to make the predictions.
Expression data for the 20 unknown samples are provided in unknown_samples.txt.
: Will use ElasticNet since it was best performing model. I list out the predictions.
### Your code here
# convert test to PCA
# Source: https://stats.stackexchange.com/questions/2592/how-to-project-a-new-vector-onto-pca-space
test_df = scale(test_data, res.pca$center, res.pca$scale) %*% res.pca$rotation
# keep top 25 PC
test_df = test_df[,c(1:25)]
# make prediction
predictions = elastic %>% predict(test_df)
predictions
## [1] Tumor Normal Tumor Normal Tumor Normal Tumor Normal Tumor Normal
## [11] Tumor Normal Tumor Normal Tumor Normal Tumor Normal Tumor Tumor
## Levels: Normal Tumor
Can you find out the top 3 genes that are most important in this prediction method in Q6? Do they have some known cancer relevance?
: To find the top 3 genes that are most important in ElasticNet from Q6, I took the product of the absolute value of the rotation/loading matrix (of the 25 PCs) and the absolute value of the vector of the coefficient matrix from ElasticNet (PC importance in the model) (reason for absolute value is per instruction on Slack). Then I find the components of the resulting vector that has the largest values.
### Your code here
# loading matrix
loading_mtx = abs(as.matrix(res.pca$rotation[,c(1:25)]))
# vector of coefficient
impt_vector = abs(as.matrix(coef(elastic$finalModel, elastic$bestTune$lambda)[-1]))
# contribution vector
contributions = loading_mtx %*% impt_vector
head(contributions[order(contributions, decreasing = TRUE), ], 3)
## SLC30A8 C16orf70 NAT9
## 0.005782189 0.005710370 0.005595463
The top 3 genes that are most important in this prediction method are SLC30A8, C16orf70, and NAT9. Looking online, I found that NAT9 and SLC30A8 are related to certain diseases - I could not find much on C16orf70. In general, I could not find any clear link to cancer with these gene (this maybe because I’m not a biologist).
Suppose a pathologist later made diagnosis on the 20 unknown samples (load the diagnosis.txt file). Based on this gold standard, draw an ROC curve of your predictions in Q6. What is the prediction AUC?
### Your code here
# consolidated data frame of actual and predicted results
results = data.frame(sample = test_label$sample,
y = as.factor(test_label$phenotype),
ypred = as.factor(predictions))
# confusion matrix
confusionMatrix(results$y, results$ypred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Tumor
## Normal 8 1
## Tumor 1 10
##
## Accuracy : 0.9
## 95% CI : (0.683, 0.9877)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 0.0009274
##
## Kappa : 0.798
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8889
## Specificity : 0.9091
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.9091
## Prevalence : 0.4500
## Detection Rate : 0.4000
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.8990
##
## 'Positive' Class : Normal
##
I made a confusion matrix on the model result. I see that it had an accuracy of 0.9 (with only 1 false positive and 1 false negative).
# ROC curve
# predictions for ROC curve
prediction_for_roc_curve = predict(elastic, test_df, type="prob")
# specify the different tissue
classes = levels(as.factor(test_label$phenotype))
# adopted from lab 4
for (i in 1:2) {
true_values = ifelse(as.factor(test_label$phenotype)==classes[i],1,0)
pred = prediction(prediction_for_roc_curve[,i], true_values)
perf = performance(pred, "tpr", "fpr")
if (i==1) {
plot(perf, main="ROC Curve", col="green")
}
else {
plot(perf, main="ROC Curve", col="red", add=TRUE)
}
# Calculate the AUC and print it to screen
auc.perf = performance(pred, measure = "auc")
print(auc.perf@y.values)
}
## [[1]]
## [1] 0.9494949
##
## [[1]]
## [1] 0.9494949
# add a legend to the graph
legend("topright", inset=.05, title="Tissue Type",
c("Normal", "Tumor"), fill=c("green", "red"), horiz=TRUE)
abline(a=0, b=1, lty=2, col="blue")
The prediction AUC is 0.9494949 for both Tumor and Normal samples. Looking at the ROC graph, I would say our model is good because it has aROC curve that is much higher than the baseline (dashed blue line).